Ancient Printed Documents Indexation: A New Approach
نویسندگان
چکیده
Based on the study of the specificity of historical printed books and on the main error sources of classical methods of page layout analysis, this paper presents a new way to achieve an indexation of ancient printed documents. We have developed an approach based on the extraction and the quantification of the various orientations that are present in printed document images. The documents are initially splitted into homogenous areas in which we analyze significant orientations with a directional rose. Each kind of information (textual or graphical) is typically identified and labelled according to its orientation distribution. This choice of characterization allows us to separate textual regions from graphical ones by minimizing the a priori knowledge. The evaluation of our proposition lies on a document image retrieval using layout extraction criteria and can also be used to precisely localize graphical parts in various types of documents. The system has been tested with success over several ancient printed books of the Renaissance.
منابع مشابه
Indexation des documents XML : Un DataGuide annoté avec un index de contenu
Indexing in classical information retrieval brings few tools for the treatment of the semi-structured documents: the representations of documents in information retrieval were conceived for flat and homogeneous documents. They are not adapted to the simultaneous treatment of the structure and the contents. Several approaches of indexing semi-structured data was proposed to resolve this new chal...
متن کاملMorphological Document Recovery in HSI Space
Old documents frequently appear with digitalization errors, uneven background, bleed-through effect etc... Motivated by the challenge to improve printed and handwritten text, we developed a new approach based on morphological color operators using HSI color space. Our approach is composed of a morphological background estimation for foreground/background separation and text segmentation, a back...
متن کاملA proposition of a robust system for historical document images indexation
Characterizing noisy or ancient documents is a challenging problem up to now. Many techniques have been done in order to effectuate feature extraction and image indexation for such documents. Global approaches are in general less robust and exact than local approaches. That’s why, we propose in this paper, a hybrid system based on global approach (fractal dimension), and a local one, based on S...
متن کاملIndexation conceptuelle par propagation. Application à un corpus d'articles scientifiques liés au cancer
Concept-based information retrieval is known to be a powerful and reliable process. However, the need of a semantically annotated corpus and its respective data structure ± e.g. a domain ontology ± can be problematic. The conception and enlargement of a semantic index is a tedious task, which needs to be addressed. We previously suggested an annotation propagation approach in a vector space rep...
متن کاملCleaning of Ancient Document Images Using Modified Iterative Global Threshold
Ancient document Image processing is an important area attracting many researchers in the recent period. Binarization is the first step while cleaning the document for further processing. Based on the degradation of the original document, either global or local thresholding methods are preferred. Thresholding phenomenon is a simple and practical approach to identify the cluster of pixels that a...
متن کامل